Skip to content

Conversation

@amametjanov
Copy link
Member

Also remove (slow) profiling calls.

[BFB]


The hangs are not reproducible with plain-vanilla --compset F2010-SCREAMv1 --res ne256pg2_ne256pg2: need to trigger caar loop dp3d limiter. If triggered, hang can occur within Lm1: need to run at least 1 model-month.

@amametjanov amametjanov self-assigned this Nov 2, 2025
@amametjanov amametjanov added bug fix PR BFB PR leaves answers BFB HOMME Aurora ALCF machine Aurora labels Nov 2, 2025
@amametjanov
Copy link
Member Author

Testing on Aurora with
run.ne256pg2_ne256pg2.F2010-SCREAMv1.sh :

  • base: hang after ~16 mdays on YYYYMMDD 20180114
  • test: continue past the hang with ~1.41 sypd

Copy link
Member Author

@amametjanov amametjanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to remove the two printf's from the GPU kernel.

Kokkos::printf("WARNING:CAAR: dp3d too small. k=%d, dp3d(k)=%f, dp0=%f \n",
k+1,dp_as_real(k),dp0_as_real(k));
}
#endif
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, i'd like to disable these prints inside GPU kernels and print them only in debug-mode: i.e.

+#ifndef NDEBUG
 #ifndef HOMMEXX_BFB_TESTING
         if(diff_as_real(k) < 0){
           Kokkos::printf("WARNING:CAAR: dp3d too small. k=%d, dp3d(k)=%f, dp0=%f \n",
                          k+1,dp_as_real(k),dp0_as_real(k));
         }
+#endif
 #endif  

Is that okay?

Copy link
Contributor

@mahf708 mahf708 Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing from kernels is a good idea. But we do want to give user some warning about this --- these warnings are very helpful when debugging instability issues (which is still a relatively common problem in F2010-SCREAMv1 variety setups). I think @mt5555 should also be consulted before completely removing these warnings as well.

One idea might be to capture/set a flag inside the kernel; then outside the kernel, if it's true (i.e., a condition is triggered), we print a statement. The specifics (levels and values) are helpful, but we can live with a simple warning (just saying the threshold was triggered) without the specifics if we must

Copy link
Contributor

@bartgol bartgol Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A || reduce is definitely a reasonable strategy. It is already implemented in the CaarFunctor, where we want to print an extensive picture of the state before crashing, if something is amiss in the functor execution.

Edit: I would also be against making the code print in debug mode only. For relatively large cases, people are not so likely to run the code in debug mode, as it would be remarkably slower (Kokkos is MUCH slower in debug mode), making debuggin painful.

vtheta_dp(ilev) *= dp(ilev);
});
} //end of min_diff < 0
kv.team_barrier();
Copy link
Member Author

@amametjanov amametjanov Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay to disable this print too: i.e. print only in debug-mode ?

         for (int ivec=0; ivec<VECTOR_SIZE; ++ivec) {
           if ( (vtheta_dp(ilev)[ivec] - m_vtheta_thresh*dp(ilev)[ivec]) < 0) {
+#ifndef NDEBUG
 #ifndef HOMMEXX_BFB_TESTING
             Kokkos::printf("WARNING:CAAR: k=%d,theta(k)=%f<%f=th_thresh, applying limiter \n",
                            ilev*VECTOR_SIZE+ivec+1,vtheta_dp(ilev)[ivec]/dp(ilev)[ivec],m_vtheta_thresh);
+#endif
 #endif
              vtheta_dp(ilev)[ivec]=m_vtheta_thresh*dp(ilev)[ivec];
           }

#endif
result = result<=diff_as_real(k) ? result : diff_as_real(k);
}, reducer);
kv.team_barrier();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This barrier is needed to finish updating min_diff, which is used below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in the previous comment, I think barriers inside a TeamThreadRange are conceptually wrong. They work in our case b/c our team sizes are <=16 and always a power of 2, and NP*NP=16. But if this were to change in the future, I think these barriers may cause deadlocks. Barriers should be outside TeamThreadRange loops.

We can help fixing this if you want. We can see if Oksana has some commit buried somewhere that addresses these barriers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second @bartgol, these should not be here. I'm curious if parallel_reduce(team) implies a sync over team threads though, similar to how parallel_reduce(exec_space) implies a fence over execution space.

Copy link
Member

@ambrad ambrad Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some background: we started adding these due to GPU race conditions on various target machines, knowing they weren't quite right for arbitrary threading configurations. The commit I linked does the right thing by keeping the syncs but rewriting loops so syncs occur at the top level.

Comment on lines -171 to -172
kv.team_barrier();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this barrier out of if (min_diff<0) { branch to avoid possible stalls waiting for threads in the else branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky issue. It's definitely wrong to have a team_barrier in a conditional like this, since it can lead to deadlock. But if NUM_LEV != NUM_PHYSICAL_LEV, I think there can be a race condition in the parallel_for-parallel_reduce sequence that this team_barrier is separating. On the GPU, NUM_LEV == NUM_PHYSICAL_LEV, and on the CPU, we don't actually have threading here in practice, so this point is not relevant in practice.

Nonetheless, I had a branch going a long time ago that fixed this type of issue in a few spots in HOMME, but I ended up getting distracted by other machine issues and never got this branch in. If might be worth consulting this commit to see how to keep this team_barrier. (The ComposeTransport change is no longer needed, but the three others are still relevant, although again only in threading/vectorization configurations we don't run in practice.)

The changes in this PR are fine if keeping them is preferred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Andrew, I think we should get your commit in (at least, the part that fixes the team barrier). While I agree that in practice it doesn't make a difference now, we don't know if our threading schemes will change 5 yy from now, and catching this kind of bugs would take time. Better put in the correct code now, and be safe later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I thought @oksanaguba had fixed this as well in one of her branches a while ago, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should get your commit in

Ok. For clarity, I'm not on this project this year, so someone else would need to bring in the changes.

vtheta_dp(ilev) *= dp(ilev);
});
} //end of min_diff < 0
kv.team_barrier();
Copy link
Member Author

@amametjanov amametjanov Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding kv.team_barrier() here to finish updating vtheta_dp, which is read below.

@mahf708 mahf708 requested a review from mt5555 November 3, 2025 02:58
Copy link
Member

@ambrad ambrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an improvement and fixes an important problem, so I see no reason to make further changes in this PR despite the good ideas for more changes. Those can go in a separate PR. Just my opinion: don't necessarily take this approval as definitive.

diff(ilev) = (dp(ilev) - m_dp3d_thresh*dp0(ilev))*spheremp;
});

kv.team_barrier();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I think this team barrier is also wrong, as it sits inside a TeamThreadRange. While in practice it's ok, since our team sizes always divide NP*NP, if someone somehow launched with team_size, say, 10, I think this may cause a deadlock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do believe @oksanaguba had run into these barriers at some point, and fixed them. Perhaps the fixes never made it to master. Oksana, do you recall anything like that? Do you know which branch may have all fixes for this stuff? We can of course fix them here, but I want to make sure we don't leave any issue behind, while we're at it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luca, I agree about the barriers. The commit I linked handles that issue, as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amametjanov if it's ok with you, I can extract the relevant part from Andrew's commit, and push to this branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please feel free to update as needed!

#endif
result = result<=diff_as_real(k) ? result : diff_as_real(k);
}, reducer);
kv.team_barrier();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in the previous comment, I think barriers inside a TeamThreadRange are conceptually wrong. They work in our case b/c our team sizes are <=16 and always a power of 2, and NP*NP=16. But if this were to change in the future, I think these barriers may cause deadlocks. Barriers should be outside TeamThreadRange loops.

We can help fixing this if you want. We can see if Oksana has some commit buried somewhere that addresses these barriers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Aurora ALCF machine Aurora BFB PR leaves answers BFB bug fix PR HOMME

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants